Acoustic Modeling of Subword Units for Large Vocabulary Speaker Independent Speech Recognition
نویسندگان
چکیده
The field of large vocabulary, continuous speech recognition has advanced to the point where there are several systems capable of attaining between 90 and 95% word accuracy for speaker independent recognition of a 1000 word vocabulary, spoken fluently for a task with a perplexity (average word branching factor) of about 60. There are several factors which account for the high performance achieved by these systems, including the use of hidden Markov models (HMM) for acoustic modeling, the use of context dependent sub-word units, the representation of between-word phonemic variation, and the use of corrective training techniques to emphasize differences between acoustically similar words in the vocabulary. In this paper we describe one of the large vocabulary speech recognition systems which is being developed at AT&T Bell Laboratories, and discuss the methods used to provide high word recognition accuracy. In particular, we focus on the techniques used to obtain acoustic models of the sub-word units (both context independent and context dependent units), and discuss the resulting system performance as a function of the type of acoustic modeling used. INTRODUCTION In the past few years there have been proposed a number of systems for large vocabulary speech recognition which have achieved high word recognition accuracy [1-6]. Although a couple of the systems have concentrated on either isolated word input [6], or have been trained to individual speakers [5, 6], most current large vocabulary recognition systems have the goal of performing speech recognition on fluent input (continuous speech) by any talker (speaker independent systems). The approach to large vocabulary speech recognition we adopt in this study is a pattern recognition based approach. For a detailed description of the system we have developed, the reader is referred to [7]. The basic speech units in the system are modeled acoustically based on a lexical description of words in the vocabulary. No assumption is made, a priori, about the mapping between acoustic measurements and phonemes; such a mapping is entirely learned via a finite training set of utterances. The resulting speech units, which we call phone-like units (PLU's) are essentially acoustic descriptions of linguistically-based units as represented in the words occurring in the given training set. The focus of this paper is a discussion of various methods used to create a set of acoustic models for characterizing the PLU's used in large vocabulary recognition (LVR). The set of context independent (CO units we used in this study is a fixed set of 47 phone-like units (PLU's), in which each PLU is associated with a linguistically defined phoneme symbol. We model each CI PLU using a continuous density hidden Markov model (CDHMM) with a Gaussian mixture state observation density. Each word model is defined as the concatenation of the PLU models according to a fixed lexicon defined by the set of 47 associated phoneme symbols. We also consider a set of context dependent (CD) units which includes PLUs' defined by left, right and both left and right context. t On leave from CSELT, Torino, Italy.
منابع مشابه
Phone transition acoustic modeling: application to speaker independent and spontaneous speech systems
HMM-based large vocabulary speech recognition systems usually have a very large number of statistical parameters. For better estimation, the number of parameters is reduced by sharing them across models. The parameter sharing is decided by regression trees which are built using phonetic classes designed either by a human expert or by data-driven methods. In situations where neither of these are...
متن کاملRecent Progress in Robust Vocabulary-Independent Speech Recognition
This paper reports recent efforts to improve the performance of CMU's robust vocabulary-independent (VI) speech recognition systems on the DARPA speaker-independent resource management task. The improvements are evaluated on 320 sentences that randomly selected from the DARPA June 88, February 89 and October 89 test sets. Our first improvement involves more detailed acoustic modeling. We incorp...
متن کاملSpeech Recognition Using Demi-Syllable Neural Prediction Model
The Neural Prediction Model is the speech recognition model based on pattern prediction by multilayer perceptrons. Its effectiveness was confirmed by the speaker-independent digit recognition experiments. This paper presents an improvement in the model and its application to large vocabulary speech recognition, based on subword units. The improvement involves an introduction of "backward predic...
متن کاملCombined Optimisation of Baseforms and Model Parameters in Speech Recognition Based on Acoustic Subword Units
A major challenge in speech recognition is creating a lexicon which is robust to inter-and intra-speaker variations. This is even more so in speech recognisers based on non-linguistic units, e.g., acoustic subword units (ASWUs), since no standard pronunciation dictionaries are available. Thus the baseforms describing the vocabulary words in terms of the recognition units need to be generated fr...
متن کاملAre Initial / Final Units Acoustically Accurate ?
| We show a comparative study of subword unit segmentation of Mandarin speech data. Most HMM recognition systems use intial//nals as subword units for Mandarin speech. We nd that such a division of monosylla-ble data into intial//nal units are not always supported by acoustic evidences. We implement a delta MFCC based seg-mentation method and compare its output with that of Viterbi segmentation...
متن کامل